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FINAL  report: 

COMPUTERIZED  ADAPTIVE  ABILITY  TESTING 


This  research  program  was  designed  to  study  four  areas  relevant  to  adaptive 
ability  testing: 

1.  Evaluation  of  adaptive  testing  branching  strategies. 

2.  Use  of  different  Item  formats  and  response  modes. 

3.  Fit  of  Individuals  to  Item  response  theory  models. 

4.  New  types  of  ability  tests  designed  specifically  for 
computerized  adaptive  administration. 

Research  In  pursuance  of  these  objectives,  originally  scheduled  for  a 
three-year  period,  began  on  April  1,  1979,  and  continued  through  June  30,  1980, 
at  which  time  the  research  objectives  of  this  project  were  combined  %rlth  those 
of  Project  NR  150-433,  "Computerized  Adaptive  Achievement  Testing.” 

This  report  summarizes  the  progress  made  during  this  15-month  period.  No 
technical  reports  were  completed  during  this  period;  technical  reports  begun  in 
this  project  will  be  completed  under  Project  NR  150-433.  For  each  of  the  four 
objectives  listed  above,  this  report  (1)  describes  the  objective,  (2)  details 
the  approaches  used  to  study  the  objective,  (3)  summarizes  results  that  were 
available  at  the  completion  of  the  reporting  period,  and  (4)  describes  tentative 
plans  for  further  research  on  the  objective  to  be  continued  In  Project  NR 
150-433. 


Adaptive  Testing  Strategies 

Previous  simulation  studies  using  adaptive  testing  have  used  relatively 
unrealistic  Item  pools  (e.g.,  Gorman,  1980;  McBride,  1976;  Reckase,  1976;  Urry, 
1970;  Urry,  1971;  Vale,  1975).  These  Item  pools  have  been  unrealistic  because 
they  assumed  that  the  Item  parameters  describing  the  Items  In  the  pool  were  com¬ 
pletely  error  free,  as  well  as  assuming  Item  difficulty  and  distribution  charac¬ 
teristics  that  did  not  reflect  those  of  real  ability  tests.  Previous  studies 
have  also  been  unrealistic  In  that  they  have  assumed  that  the  responses  of  the 
hypothetical  testees  to  these  Items  have  conformed  precisely  to  the  one-dimen¬ 
sional  latent  trait  model.  When  the  results  of  previous  simulation  studies  are 
extrapolated  to  real  Item  pools  constructed  from  real  item  parameters,  they  may 
not  generalize,  because  real  Item  pools  are  constructed  from  Item  parameters 
that  Include  estimation  error  and  may  deviate  substantially  from  unldlmenslonal- 
Ity. 

Objective 

The  objectives  of  this  research  program  were  to  evaluate  the  performance  of 
adaptive  testing  strategies  under  conditions  that  more  reasonably  represent  the 
conditions  under  which  these  strategies  might  occur  in  live-testing  applications 
and  to  compare  findings  from  selected  simulation  studies  to  those  obtained  In 
live  testing.  Research  during  the  reporting  period  was  concerned  with  (1)  ef- 
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fects  of  errors  in  Item  parameters  on  the  performance  of  adaptive  testing  strat¬ 
egies  and  (2)  live-testing  comparisons  of  adaptive  testing  strategies. 

Effects  of  Errors  In  Item  Parameter  Estimates 
on  Adaptive  Testing  Strategies 

Approach .  This  objective  was  pursued  by  means  of  monte  carlo  simulation 
studies  that  built  upon  empirical  Information  regarding  the  nature  and  extent  of 
errors  In  Item  parameter  estimates  due  to  the  numbers  of  testees  and  Items  on 
which  the  Item  response  theory  (IRT)  Item  parameters  %rere  estimated.  Data  on 
the  kinds  and  degrees  of  error  associated  with  IRT  Item  parameterization  tech¬ 
niques  by  different  Item  parameterization  methods  were  modeled  In  the  monte 
carlo  simulations.  The  kinds  and  degrees  of  errors  observed  In  real  data  Item 
parameterization  were  translated  Into  the  monte  carlo  simulation  model  and 
served  as  Independent  variables  In  a  series  of  studies  systematically  varying 
the  magnitude  and  kind  of  Item  parameterization  estimation  error  for  the  diffi¬ 
culty,  discrimination,  and  "guessing"  parameters  separately  and  In  combination. 
Dependent  variables  In  these  studies  were  test  Information,  bias,  correlation  of 
ability  estimates  with  true  ability,  and  other  characteristics  of  the  ability 
estimates  derived  from  the  application  of  selected  adaptive  testing  strategies; 
two  conventional  tests  were  also  Included  In  the  study  for  comparison  purposes. 
The  studies  were  also  designed  to  use  an  Item  pool  that  realistically  reflected 
the  composition  of  real  Item  pools  used  In  actual  ability  tests,  in  terms  of  the 
distributions  of  the  IRT  Item  parameter  estimates. 

Figure  1  summarizes  the  design  of  this  study.  Using  a  three-parameter  IRT 
model  and  an  Item  pool  designed  to  reflect  an  adaptive  testing  Item  pool  that 
had  been  used  In  a  live-testing  study,  monte  carlo  data  were  generated  for  100 
slmulees  at  each  of  17  levels  of  ability,  ranging  from  6  -  -3.2  to  6  >  +3.2. 
Based  on  data  available  In  this  IRT  Item  parameterization  literature,  varying 
degrees  of  error  were  added  to  the  parameter  estimates  for  Item  discrimination 
(£),  difficulty  (^),  and  "guessing”  (£).  Table  1  shows  the  item  parameter  sets 
used  In  this  study:  Set  1  was  the  baseline  comparison  data  set  In  which  there 
was  no  error  In  the  Item  parameter  estimates;  In  Sets  11,  12,  and  13  varying 
amounts  of  error  were  added  to  the  £  parameter;  In  Sets  21,  22,  23,  and  24  error 
was  added  to  the  ^  parameter;  Sets  31  and  32  added  errors  to  the  £  parameter; 
and  In  Sets  41  and  42  errors  occurred  in  all  three  parameters  simultaneously. 

Using  the  error-laden  item  parameter  sets,  three  types  of  adaptive  tests 
(stratified  adaptive,  or  stradaptlve;  maximum  information;  and  Bayesian)  were 
administered  to  each  of  the  1,700  slmulees.  All  tests  were  scored  by  maximum 
likelihood  at  test  lengths  of  5  to  30  items,  in  Incren^nts  of  5  Items.  In  addi¬ 
tion,  both  peaked  and  rectangular  conventional  tests  were  constructed  using 
classical  test  construction  procedures;  and  these  tests,  along  with  the  adaptive 
tests  using  the  error-free  Item  pool,  were  also  administered  to  the  1,700  slmu¬ 
lees.  Testing  strategies  were  compared  in  terms  of  fidelity  (the  correlation  of 
true  and  estimated  6  levels),  observed  and  theoretical  information,  efficiency. 
Inaccuracy,  bias,  and  root  mean  square  error  (RMSE)  for  the  6  estimates. 

Results.  Table  2  presents  a  selection  of  the  results  for  four  of  the  de¬ 
pendent  variables.  The  fidelity  measure  was  computed  on  a  normally  distributed 
sample  of  300  slmulees;  data  for  the  other  criterion  measures  were  averaged 
across  the  1,700  slmulees.  As  Table  2  shows,  with  the  exception  of  Item  Set  42, 


Figure  1 

Design  of  Monte  Carlo  Simulation  Study  of  Effects  of 
Errors  in  Item  Parameter  Estimates  on  Adaptive  Testing  Strategies 
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Table  1 

Error  Siaulated  in  Item  Parameter  Estimate  Sets 


Item 

Set 

Description 

Specified  RMSE 
a  b  c 

Obtained 
a  £ 

RMSE 

c 

Obtained  r 
£  £ 

(£iE) 

£ 

1 

Error-Free 

Item  Set 

.00 

.00 

o 

o 

• 

.00 

.00 

.00 

1.00 

1.00 

1.00 

11 

Small  Error 

In  a 

.20 

o 

o 

« 

.00 

.22 

.00 

.00 

.75 

1.00 

1.00 

12 

Moderate  Error 

In  a 

.40 

.00 

.00 

.39 

.00 

.00 

.56 

1.00 

1.00 

13 

Large  Error 

In  a 

.60 

• 

o 

o 

.00 

.52 

.00 

.00 

.37 

1.00 

1.00 

21 

Moderate  Error 

In  b 

.00 

o 

« 

.00 

o 

o 

• 

.09 

.00 

1.00 

.99 

1.00 

22 

Large  Error 

In  b 

.00 

.30 

.00 

.00 

.30 

.00 

1.00 

.98 

1.00 

23 

Extreme  Error 

In  b 

.00 

1.00 

.00 

.00 

.89 

o 

o 

a 

1.00 

00 

00 

a 

1.00 

24 

Very  Large 

Error  In  b 

.00 

.50 

.00 

.00 

.48 

.00 

1.00 

.96 

1.00 

31 

Moderate  Error 

In  c 

.00 

.00 

.04 

.00 

.00 

.04 

1.00 

1.00 

.70 

32 

Large  Error 
in  c 

.00 

.00 

.08 

.00 

.00 

.08 

1.00 

1.00 

.46 

41 

Worst  Probable 
Combined  Error 

.60 

.30 

.08 

.51 

.32 

.08 

.47 

.98 

.46 

42 

Extreme 

Combined  Error 

.60 

.00 

.08 

.58 

.97 

00 

o 

• 

.44 

.88 

.43 

which  represented  extreme  (and  probably  unrealistic)  levels  of  error  In  all 
three  Item  parameters,  the  adaptive  teats  trith  error-laden  Item  parameters 
achlevbd  higher  fidelities  at  all  test  lengths  than  did  the  peaked  (F)  and  rect¬ 
angular  (R)  conventional  tests,  with  larger  differences  occurring  for  shorter 
test  lengths.  There  were  virtually  no  differences  In  fidelities  for  the  adap¬ 
tive  strategies  at  20-  or  30-ltem  test  lengths,  with  a  tendency  for  the  maximum 
Information  (MI)  adaptive  test  to  perform  somewhat  more  poorly  than  the  strati¬ 
fied  adaptive  (SA)  or  Bayesian  (B)  tests  at  lO-ltem  test  lengths.  Results  for 
the  other  dependent  measures  tended  to  support  the  fidelity  analysis;  that  is, 
with  the  exception  of  Item  Set  42,  adaptive  tests  using  error-laden  Item  parame¬ 
ter  estimates  generally  achieved  scores  with  lower  levels  of  Inaccuracy,  bias, 
and  RMSE  than  did  conventional  tests  of  the  same  lengths  using  error-free  item 
estimates. 

Analyses  of  the  data  In  terms  of  dependent  measures  conditioned  on  values 
of  6 — Inaccuracy,  bias,  RMSE,  the  two  information  measures,  and  efficiency — sup¬ 
ported  the  findings  from  the  overall  analysis.  When  errors  occurred  in  the  £, 
b,  and  £  parameters  separately,  there  was  very  little  effect  on  these  Indices 
and  the  adaptive  tests  measured  better  than  the  conventional  tests  at  virtually 
all  levels  of  6.  There  was  essentially  no  measurement  degradation  as  the  result 
of  errors  In  £  and  £,  with  a  slightly  greater  effect  for  b.  For  realistic  val- 
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ues  of  combined  error  In  the  three  Item  parameters,  some  measurement  degradation 
occurred  for  the  adaptive  tests,  but  they  still  measured  better  than  the  rectan¬ 
gular  conventional  test  at  all  6  levels,  and  better  than  the  peaked  conventional 
test  for  about  three-fourths  of  the  6  scale.  There  were  few  consistent  differ¬ 
ences  In  the  performance  of  the  different  adaptive  testing  strategies. 

Additional  research  In  progress.  Results  of  this  study  Indicated  very  lit¬ 
tle  effect  of  errors  In  Item  parameter  estimates  on  the  measurement  performance 
of  adaptive  testing  strategies.  Since  this  study  was  the  first  to  Investigate 
this  question.  It  was  necessarily  limited  In  a  number  of  ways.  Consequently, 
further  simulations  are  planned  that  (1)  vary  the  characteristics  of  the  Item 
pool  used  In  order  to  determine  the  generality  of  the  findings  across  Item  pools 
with  different  characteristics.  In  terms  of  levels  of  the  three  Item  parameters; 
that  (2)  allow  correlated  errors  to  occur  In  the  Item  parameter  estimates,  since 
only  uncorrelated  errors  were  used  In  this  study;  and  that  (3)  examine  the  ef¬ 
fects  of  error  In  Item  parameter  estimates  separately  for  one-,  two-,  and  three- 
parameter  IRT  models. 

Live-Testing  Comparison  of  Adaptive  Testing  Strategies 

Approach.  Three  testing  strategies — peaked  conventional,  Bayesian  adap¬ 
tive,  and  maximum  Information  adaptive— were  compared  on  the  basis  of  alternate 
forma  reliability  and  observed  Information.  The  tests  were  composed  of  60  five- 
choice  vocabulaiy  Items  that  were  divided  into  two  30-1 tern  alternate  forms.  The 
conventional  test  was  peaked  In  Information  values  evaluated  at  6  >  0.0.  Items 
administered  In  the  maximum  Information  and  Bayesian  testing  strategies  were 
selected  according  to  their  adaptive  item  selection  routines.  There  were  373 
students  In  the  conventional  testing  condition,  390  in  the  Bayesian  testing  con¬ 
dition,  and  233  In  the  maximum  Information  testing  condition. 

Testing  strategy  was  the  major  Independent  variable  of  Interest.  Methods 
of  scoring  were  also  compared.  These  Included  logistic  maximum  likelihood  scor¬ 
ing,  Bayesian  scoring,  and  (for  the  conventional  test)  proportion-correct  scor¬ 
ing.  Test  length  was  a  third  Independent  variable  of  Interest.  Thirty  test 
lengths  were  obtained  by  scoring  each  30-ltem  test  at  each  test  length  from  1  to 
30  items.  Testing  strategies  were  compared  on  the  basis  of  alternate  forms  re¬ 
liability  by  correlating  corresponding  ability  estimates  obtained  from  Forms  A 
and  B  for  a  given  testing  strategy. 

Since  the  test  data  were  scored  In  at  least  two  ways  (Bayesian  and  maximum 
likelihood),  a  total  of  seven  combinations  of  testing  strategy  and  scoring  meth¬ 
od  were  compared  on  the  basis  of  alternate  forms  reliability.  Scoring  strategy 
was  compared  on  the  basis  of  alternate  forms  reliability  by  comparing  reliabili¬ 
ties  of  a  single  testing  strategy  scored  by  more  than  one  method.  Three  of  the 
alternate  forms  reliabilities  paired  the  appropriate  scoring  method  with  each  of 
the  three  testing  strategies.  These  were  proportion-correct  scoring  of  conven¬ 
tional  tests,  maximum  likelihood  scoring  of  maximum  Information  tests,  and 
Bayesian  scoring  of  Bayesian-admlnlstered  tests.  The  remaining  four  alternate 
forms  reliabilities  were  obtained  by  scoring  the  item  response  data  by  a  scoring 
routine  other  than  the  appropriate  one.  In  this  way,  reliabilities  were  ob¬ 
tained  for  the  Bayesian  scoring  of  the  maximum  information  test,  maximum  likeli¬ 
hood  scoring  of  the  Bayesla  test,  Bayesian  scoring  of  the  conventional  test, 
and  maximum  like’  hood  sr  ^ng  of  the  conventional  test.  Reliabilities  were 


calculated  as  a  function  of  test  length.  Scoring  method  correlations  were  ob¬ 
tained  by  correlating  estimates  obtained  from  different  scorings  of  the  same 
testing  strategy.  These  correlations  were  used  to  analyze  the  similarity  of 
ability  estimates  obtained  from  different  scoring  methods  applied  to  a  single 
set  of  data. 

The  three  testing  strategies  were  also  compared  on  the  basis  of  their  er¬ 
rors  of  measurement.  This  was  assessed  In  two  ways:  (1)  using  estimated  errors 
of  measurement  derived  from  maximum  likelihood  scoring  and  (2)  using  estimated 
errors  of  measurement  from  Bayesian  scoring.  In  the  first  method,  test  Item 
responses  were  scored  by  maximum  likelihood  methods,  and  the  standard  errors  of 
measurement  (SEM)  associated  with  each  ability  estimate  was  calculated.  These 
values  are  the  reciprocal  of  the  square  root  of  test  Information  at  a  given  6 
level  and  estimate  the  standard  deviation  of  the  estimated  6  values  around  the 
true  9  value;  the  larger  the  SEM,  the  more  likely  the  estimate  will  be  Inaccu¬ 
rate.  The  posterior  variance  of  the  Bayesian  ability  estimate  was  the  second 
Index  used  to  compare  the  testing  strategies  on  the  basis  of  measurement  accura¬ 
cy.  Both  the  SEM  and  the  posterior  variances  were  examined  as  a  function  of 
estimated  ability  level. 

Results.  A  preliminary  report  of  the  results  of  this  study  Is  In  Johnson 
and  Weiss  (1980).  Parallel  forms  reliabilities  of  the  three  testing  strategies 
showed  that  after  11  Items  the  peaked  conventional  test  yielded  higher  reliabil¬ 
ities  than  either  of  the  adaptive  tests.  The  greatest  difference  between  reli¬ 
abilities  was  £  «  .09  between  the  adaptive  and  conventional  tests  at  the  30-1 tern 
test  length;  the  reliabilities  of  the  adaptive  tests  were  £  >  .81,  compared  with 
the  final  reliability  of  £  ■  .90  for  the  conventional  test.  The  conventional 
test  reliability  was  nearly  Identical  to  that  of  the  Bayesian  test  up  to  the 
10-ltem  test  length,  but  after  that  point  the  conventional  test  reliability  In¬ 
creased  more  quickly  than  that  of  the  adaptive  tests.  Although  adaptive  test 
reliabilities  showed  signs  of  leveling  off  toward  the  end  of  the  test,  the  reli¬ 
ability  of  the  conventional  test  appeared  to  Increase  steadily. 

In  comparisons  of  testing  strategies  scored  by  other  than  optimal  scoring 
strategies,  the  Bayesian  scoring  of  the  conventional  and  maximum  Information 
testing  strategies  yielded  higher  reliabilities  than  the  maximum  likelihood 
scoring  of  the  conventional  and  Bayesian  testing  strategies.  These  data  Indi¬ 
cate  that  Bayesian  scoring  of  an  adaptive  test  may  yield  more  stable  estimates 
of  ability  than  maximum  likelihood  scoring.  The  data  also  Illustrate  the  Inap- 
proprlateness  of  scoring  conventional  tests  with  maximum  likelihood  scoring 
methods,  since  extremely  low  reliabilities  (maximum  of  £  ■  .75)  were  obtained  at 
all  test  lengths.  The  correlations  between  scores  on  the  same  testing  strategy 
scored  by  different  methods  showed  that  the  highest  correlations  %fere  obtained 
for  Bayesian  and  proportion-correct  scores  of  the  conventional  test,  with  most 
correlations  between  .97  and  .99.  The  second  highest  level  of  correlation  %»s 
between  the  Bayesian  and  maxlmum-llkelihood-scored  maximum  Information  test, 
with  most  correlations  between  .93  and  .95.  When  the, maximum  Information  adap¬ 
tive  test  was  scored  by  the  Bayesian  scoring  method,  reliabilities  of  short 
adaptive  tests  were  higher  than  those  of  the  conventional  test,  and  differences 
In  reliabilities  were  smaller  at  longer  test  lengths. 

On  the  basis  of  the  reliability  data,  few  conclusions  can  be  drawn  about 
the  relative  merits  of  the  three  testing  strategies.  Limitations  of  the  Item 
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pool  might  account  In  part  for  the  lowered  reliability  of  the  adaptive  tests  In 
comparison  to  the  conventional  test,  since  adaptive  tests  depend  heavily  on  the 
quality  of  the  Items  In  the  Item  pool.  The  Item  pool  used  for  the  two  adaptive 
tests  had  fewer  Items  at  the  extremes  of  the  ability  range,  and  these  Items  had 
relatively  lower  discrimination  parameters.  Especially  at  abilities  where  there 
were  fewer  Items,  It  Is  likely  that  the  correlations  between  ability  estimates 
would  be  attenuated  and  that  the  adaptive  process  would  be  at  a  disadvantage  as 
testing  progressed.  The  result  would  be  that  toward  the  end  of  testing  there 
would  be  fewer  and  fewer  Items  available  at  a  given  ability  level. 

Another  factor  that  limits  the  comparison  of  the  testing  strategies  In 
terms  of  alternate  forms  reliability  correlations  Is  the  distribution  of  ability 
In  the  population.  Since  values  of  the  Pearson  product-moment  correlations  de¬ 
pend  on  the  distributions  of  the  ability  estimates  Involved,  different  ability 
distributions  can  result  In  different  levels  of  correlation.  Thus,  the  reli¬ 
ability  correlations  confound  the  distribution  of  the  ability  estimates  with  the 
measurement  precision  of  the  testing  strategies. 

Errors  of  measurement  derived  from  test  Information  yield  comparisons  of 
testing  strategies  that  are  unconfounded  by  the  distribution  of  the  ability  es¬ 
timates.  Comparisons  of  the  testing  strategies  on  the  basis  of  SEMs  and  poste¬ 
rior  variances  showed  that  at  no  point  on  the  ability  continuum  were  the  errors 
of  measurement  smaller  In  the  conventional  test  than  In  the  adaptive  tests.  In 
both  error  of  measurement  comparisons  there  was  poorer  measurement  at  the  low 
end  of  the  ability  distribution,  although  the  extremes — both  positive  and  nega¬ 
tive — were  less  precisely  measured  than  the  center  of  the  ability  continuum. 

The  results  Indicate  that  the  adaptive  tests  yielded  about  the  same  level  of 
measurement  precision  and  that  these  levels  were  greater  than  those  obtained 
from  the  conventional  test  at  all  levels  of  ability.  Thus,  adaptive  testing 
strategies  yielded  scores  with  greater  precision/information  (lower  errors  of 
measurement)  than  did  the  conventional  testing  strategy. 

Additional  research  In  progress.  Since  the  reliability  results  of  this 
study  were  contrary  to  expectations  and  conflicted  with  other  research  using  a 
similar  design  but  different  tests  (Kingsbury  &  Weiss,  1980;  McBride,  1980),  a 
fourth  test  was  added  to  the  study.  To  examine  the  effects  of  test  difficulty 
on  the  results,  this  test  was  a  second  conventional  test  In  which  average  Item 
difficulty  was  higher  than  that  of  the  first  conventional  test.  Data  were  col¬ 
lected  using  this  test  from  530  students  on  a  60-ltem  conventional  test  consist¬ 
ing  of  two  embedded  30-ltem  alternate  forms.  The  alternate  forms  reliabilities 
of  these  tests  will  be  computed  at  test  lengths  from  1  to  30  Items,  and  the  data 
will  be  further  analyzed  to  permit  direct  comparisons  with  the  three  other  test¬ 
ing  strategies. 

Future  Research  Plans 

In  addition  to  using  Item  parameters  that  contain  varying  degrees  of  error, 
real  adaptive  testing  Item  pools  may  deviate  from  the  unldlmenslonal  IRT  model 
that  has  been  applied  In  all  adaptive  testing  simulations.  Since  deviations 
from  unldlmenslonality  (e.g.,  Bejar,  1979,  1980;  Bejar,  Weiss,  &  Kingsbury, 

1977;  Reckase,  1978)  can  potentially  affect  the  performance  of  adaptive  testing 
strategies,  a  series  of  monte  carlo  simulation  studies  will  be  constructed 
around  the  degrees  and  types  of  dimensionality  observed  In  ability  test  data. 


These  studies  will  consist  of  the  generation  of  testee  responses  using  the  un¬ 
derlying  multidimensional  structures  observed  In  ability  test  Items,  but  adap¬ 
tive  branching  will  occur  by  means  of  several  adaptive  testing  strategies  de¬ 
signed  for  unldlmenslonal  adaptive  testing.  Thus,  the  research  question  will  be 
the  effects  of  violation  of  the  unldlmenslonallty  assumption  on  the  performance 
of  adaptive  testing  strategies.  Again,  the  evaluative  criteria  will  consist  of 
Information,  bias,  correlation  of  true  ability  and  ability  estimates,  and  other 
characteristics  of  the  ability  estimates  derived  from  the  unldlmenslonal  adap¬ 
tive  testing  strategies. 


Item  Formats  and  Response  Modes 

The  use  of  Interactive  computers  to  administer  ability  tests  allows  the 
design  and  u^e  of  test  Items  that  do  not  make  use  of  the  typical  multiple-choice 
Item  forma;  Research  on  alternatives  to  the  typical  multiple-choice  item 
(Bejar,  1975;  Vale,  1977)  suggests  that  there  Is  considerable  improvement  possi¬ 
ble  In  Information  utilization  from  adaptive  testing  by  use  of  response  modes 
other  than  multiple-choice  Items.  Consequently,  continued  research  in  this  area 
was  Indicated. 

Objective 

To  evaluate  the  utility  for  adaptive  testing  of  a  number  of  response  modes 
and  Item  formats  usable  In  adaptive  testing. 

Approach 

This  objective  was  pursued  using  the  six  Item  types  shown  In  Figure  2.  The 
studies  were  concerned  with  the  following  characteristics  of  these  Item  types: 

1 .  rhe  relationship  of  responding  In  the  various  formats  to  ability  lev- 
:!ls. 

2.  Tl'i:°  reliability  of  test  Item  responses  and  ability  estimates  obtained 
In  the  various  formats. 

3.  Information  characteristics  of  Items  and  tests  utilizing  the  various 
formats. 

4.  The  relationships  among  test  Item  responses  using  the  various  formats. 

5.  The  relative  validity  of  responses  obtained  from  the  various  formats. 

6.  The  generality  of  findings  obtained  from  the  different  response  formats 
to  different  populations  and  different  ability  dimensions. 

7.  The  comparative  factor  structure  of  tests  administered  In  the  various 
formats. 

The  research  was  designed  to  stxidy  the  characteristics  of  the  six  Item  for¬ 
mats  In  several  ability  areas.  It  Is  essentially  a  search  for  an  Item  format 
that  allows  testees  to  express  as  much  knowledge  as  they  have  available  about  a 
given  question  while  minimizing  the  effects  of  guessing.  The  results  obtained 
from  this  series  of  studies  will  be  used  to  select  several  Item  formats  to  be 
used  In  computerized  adaptive  testing. 


Two  sets  of  30  multiple-choice  items  were  chosen  from  available  item  pools. 
One  set  of  Items  was  chosen  from  a  pool  of  analogy  Items,  and  the  second  set  was 
chosen  from  a  pool  of  arithmetic  reasoning  items.  Both  Item  pools  Included  Item 


Figure  2 

Description  of  Response  Formats 


1.  Multiple-choice  Items  with  conventional  response  format.  These  items  were 
conventional  multiple-choice  Items  with  four  alternatives.  The  examinee  was 
asked  to  choose  the  correct  answer. 

Example .  Procedure  :  Activity 
1.  Diplomacy  :  Tact 
*2.  Itinerary  :  Journey 

3.  Minutes  :  Committee 

4.  Index  :  Book 


2.  Multiple-choice  Items  with  probabilistic  response  format.  These  items  were 
exactly  the  same  as  the  conventional  multiple-choice  Items,  but  the 
examinees  were  asked  to  assign  100  points  among  the  four  alternatives  to 
Indicate  their  confidence  In  the  '‘correctness”  of  each  alternative. 


Example .  Procedure  :  Activity  (Possible  Answer) 


1. 

Diplomacy  :  Tact 

0 

2. 

Itinerary  :  Journey 

75 

3. 

Minutes  :  Committee 

25 

4. 

Index  :  Book 

0 

3.  Dichotomous  Items  with  a  yes-no  (dichotomous)  response  format.  For  these 
items  only  the  Item  stem  and  one  alternative  vare  presented.  The  examinees 
were  asked  to  respond  with  a  "yes”  if  they  thought  the  alternative  provided 
was  a  correct  answer  to  the  question,  and  "no”  If  they  thought  It  was  not. 

Example .  Q.  Procedure  :  Activity  (Possible  Answer) 

A.  Index  :  Book  Yes 

4.  Dichotomous  Items  with  a  probabilistic  response  format.  These  Items  were 
identical  to  the  dichotomous  Items  with  a  yes-no  response  format,  but,  the 
examinees  were  asked  to  respond  with  a  probability  (a  number  from  0  to  100) 
which  reflected  their  confidence  that  the  alternative  provided  was  a  correct 
answer  to  the  question. 

Example .  Q.  Procedure  :  Activity  (Possible  answer) 

A.  Index  :  Book  10 


5.  Free-response  Items  with  a  conventional  response  format.  For  these  Items 
only  the  Item  stem  was  presented.  The  examinees  were  asked  to  provide  their 
own  answers. 

Example .  Q.  Procedure  :  Activity 
Itinerary  ;  _ 


6.  Free-response  Items  with  a  probabilistic  response  format.  Once  again  only 
the  Item  stem  was  presented,  but  this  time  the  examinees  were  asked  to 
provide  an  answer  to  the  question  and  to  assign  a  probability  (a  number  from 
0  to  100)  to  the  answer  they  gave  to  indicate  their  confidence  In  the 
"correctness”  of  their  response. 

Example .  Q.  Procedure  :  Activity  (Possible  answer) 

Itinerary  ;  _  Trip  90 
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parameters  calculated  on  large  numbers  of  Individuals  using  the  three-parameter 
logistic  model  of  LOGIST  (Wood  &  Lord,  1976;  Wood,  Wlngersky,  &  Lord,  1976). 

The  items  were  chosen  to  represent  a  uniform  range  of  difficulty  and  discrimina¬ 
tion  parameters.  Each  set  of  30  items  was  then  modified  to  conform  to  the  Item 
formats  shown  in  Figure  2.  For  each  Item,  the  Item  stem  remained  the  same, 
while  the  response  formats  were  changed,  resulting  In  the  6  sets  of  30  Items, 
each  with  a  different  response  format.  Tests  were  then  constructed  utilizing 
these  Items  and  the  tests  were  administered  In  various  combinations  to  a  number 
of  groups  of  several  hundred  college  students. 

The  data  collected  will  be  used  to  determine  the  factor  structure  and  the 
convergent  and  discriminant  validity  of  the  tests  using  the  different  response 
formats  In  the  two  ability  domains  and  will  be  compared  with  similar  data  al¬ 
ready  available  on  vocabulary  ability.  The  data  will  also  be  used  to  compare 
the  Item  parameters.  Item  and  test  Information  functions.  Internal  consistency, 
and  Interrelationships  among  scores  derived  from  the  various  response  formats. 

Results 


Preliminary  data  analyses  were  completed  on  data  collected  on  three  of  the 
six  Item  formats  using  the  analogies  Items.  The  formats  analyzed  were  multiple- 
choice  Items  with  conventional  response  format  (MCC),  multiple-choice  Items  with 
probabilistic  response  format  (MCP),  and  dichotomous  Items  with  a  dichotomous 
(MCD)  response  format  (Types  1  through  3  In  Figure  2). 

Examination  of  Table  3  shows  the  average  Item  scores  for  the  30  Items  In 
the  three  formats.  The  MCC  and  MCP  Items  were  rank  ordered  very  similarly,  as 
evidenced  by  the  correlation  of  .91  between  the  average  Item  scores  for  these 
two  formats.  The  DD  Items  ordered  themselves  somewhat  differently,  as  Indicated 
by  the  correlations  between  the  DD  average  Item  scores  and  the  MCC  and  MCP  aver¬ 
age  Item  scores,  which  were  .43  and  .52,  respectively.  Although  the  present 
results  reflect  only  one  scoring  system  for  the  MCP  Items,  several  other  scoring 
methods  will  also  be  Investigated  for  this  response  format. 

Table  4  shows  validity  and  internal  consistency  reliability  coefficients 
obtained  for  the  three  item  formats.  The  validity  coefficient  reported  Is  the 
correlation  of  total  score  with  the  reported  grade-point  average  of  the  students 
(later  analyses  are  planned  using  actual  GPA  rather  than  reported  GPA) .  The 
validity  coefficients  for  all  three  formats  were  not  significantly  different 
from  each  other,  and  were  significantly  different  from  zero.  The  MCP  Items  had 
the  highest  Internal  consistency  reliability,  but  the  lowest  validity.  The  MCC 
Items  had  the  second  highest  reliability,  and  highest  validity;  and  the  DD  Items 
had  lowest  reliability  with  moderate  validity. 

A  number  of  factor  analyses  were  performed  on  the  data  from  the  three  Item 
formats.  Both  principal  axes  and  confirmatory  analyses  were  used.  The  princi¬ 
pal  axes  analyses  showed  that  two  orthogonal  factors  were  extracted  for  each 
response  format,  and  the  pattern  of  positive  and  negative  loadings  was  extremely 
similar  for  the  MCC  and  MCP  Items  but  was  very  different  for  the  DD  Items.  In 
addition,  the  percent  of  total  score  variation  accounted  for  by  the  two-factor 
solution  varied  with  Item  format. 


Confirmatory  factor  analysis  was  performed  using  the  principal  axes  factor 
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Table  3 

Average  Item  Scores  for  the  Same  30  Analogies  Items  in 
Multiple-Choice  Conventional  Format  (MCC) , 
Multiple-Choice  Probabilistic  Format  (MCP), 
and  Dichotomous-Dichotomous  (DD)  Format 


Item  _ 

Response  Format 

Number 

MCC* 

MCP** 

DD* 

1 

.86 

1.18 

.87 

2 

.64 

.66 

.42 

3 

.78 

.99 

.65 

4 

.52 

.58 

.81 

5 

.73 

1.02 

.80 

6 

.79 

1.08 

.52 

7 

.28 

.24 

.64 

8 

.48 

.64 

.61 

9 

.61 

.92 

.51 

10 

.58 

.80 

.73 

11 

.60 

.73 

.44 

12 

.89 

1.19 

.71 

13 

.77 

1.08 

.79 

14 

.62 

.77 

.32 

15 

.75 

1.29 

.82 

16 

.78 

1.07 

.92 

17 

.64 

.94 

.70 

18 

.84 

1.48 

.87 

19 

.41 

.63 

.80 

20 

.68 

1.05 

.84 

21 

.87 

1.14 

.98 

22 

.84 

1.06 

.65 

23 

.87 

1.32 

.78 

24 

.91 

1.41 

.98 

25 

.68 

1.08 

.56 

26 

.77 

.88 

.80 

27 

.75 

1.25 

.83 

28 

.86 

1.29 

.94 

29 

.54 

.60 

.58 

30 

.35 

.42 

.66 

*Proportion  correct. 

**Average  score  with  range  from  2.00  to  -1.00. 


solution  for  the  MCC  items  as  the  basis  for  the  model.  The  data  showed  better 
fit  to  the  model  for  the  MCP  items  than  for  the  DD  items,  with  the  second  factor 
a  rather  inconsequential  factor  for  all  three  response  formats. 

Thus,  the  results  obtained  thus  far  advise  against  the  use  of  the  DD  item 
format.  This  item  format  is  less  reliable  than  the  other  response  formats,  has 
only  moderate  validity,  shows  high  levels  of  guessing,  and  does  not  appear  to  be 
consistently  measuring  analogies  ability.  The  DD  response  format  does  not  ap¬ 
pear  to  be  a  viable  alternative  to  the  multiple-choice  item.  On  the  other  hand, 
the  MCP  item  format  does  appear  to  be  a  promising  alternative  to  the  traditional 


A  _l. 


-  13  - 


Table  4 

Alpha  Internal  Consistency  Reliability  Coefficients 
and  Validity  Correlations  with  Reported  GPA 
for  Three  Response  Formats 


Response  Format 

N 

Coefficient 

Alpha 

Validity 

Coefficient 

Multiple-Choice 

Conventional 

486 

.85 

.23 

Multiple-Choice 

Probabilistic 

299 

.91 

.17 

Dichotomous- 

Dichotomous 

303 

.59 

.20 

multiple-choice  Item.  It  Is  more  reliable,  nearly  equally  as  valid,  and  appears 
to  be  measuring  analogies  ability  to  a  greater  degree  than  the  conventional  mul¬ 
tiple  choice-items. 

Additional  Research  In  Progress 


Further  research  and  analysis,  of  course,  are  needed  to  determine  whether 
or  not  these  results  are  generallzable  across  ability  domains,  and  whether  any 
of  the  Item  formats  not  yet  analyzed  will  also  show  promise  as  alternatives  to 
the  standard  multiple-choice  Item.  Considerable  amounts  of  additional  data  will 
be  obtained  on  these  Item  formats  using  both  the  analogies  and  numerical  reason¬ 
ing  Items,  and  the  resulting  data  will  be  analyzed  to  Investigate  the  questions 
raised  earlier.  In  addition,  further  evidence  of  the  generality  of  the  findings 
will  be  sought  using  vocabulary  ability  Items  that  are  being  analyzed  by  similar 
methodologies.  Once  remaining  Item  types  are  Identified  as  replacements  for  the 
multiple-choice  Item,  adaptive  tests  using  these  Item  types  will  be  designed  and 
their  characteristics  Investigated. 

Fit  of  Individuals  to  Item  Response  Theory  Models 

Previous  research  on  the  person  response  curve  (Trabln  &  Weiss,  1979)  and 
related  research  on  person  fit  (Levine  &  Drasgow,  1980;  Levine  &  Rubin,  1979) 
promises  the  capability  of  Identifying  Individuals  who,  on  a  given  test,  are  not 
responding  In  accordance  with  a  given  IRT  model.  This  lack  of  fit  to  the  model 
can  derive  from  a  number  of  possible  causes,  including  the  following: 

1.  Lack  of  motivation  to  respond  appropriately  to  the  test; 

2.  Inappropriate  or  nonrandom  guessing,  or  lack  of  guessing; 

3.  Responses  that  are  not  In  accord  with  the  unldlmenslonallty  assumption 
of  ICC  theory. 

Knowledge  that  an  Individual  Is  not  responding  according  to  the  model  for  any  of 
these  reasons  would  be  appropriate  information  to  be  used  In  applied  situations 
suggesting  that  the  scores  of  that  person  be  carefully  considered  before  deci¬ 
sions  are  made  on  the  basis  of  those  scores.  It  may  also  be  an  Important  moder¬ 
ator  variable  for  use  In  prediction  studies. 
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Objective 


To  further  Investigate  the  utility  of  the  person  response  curve  (PRC)  con¬ 
cept  and  to  Identify  the  correlates  of  deviations  of  an  Individual  from  the  IRT 
model. 

Approach 

To  properly  Investigate  the  usefulness  of  the  PRC  approach  to  the  problem 
of  person  fit.  Its  characteristics  were  Investigated  within  the  context  of  other 
Indices  designed  for  similar  purposes.  Based  on  a  thorough  review  of  the  rela¬ 
tively  small  literature  on  person  fit  (also  known  as  "appropriateness”  measure¬ 
ment),  the  following  Indices  were  Identified: 

1.  Trabln  and  Weiss's  (1979)  chi-square  Index  of  the  fit  of  observed  and 
expected  PRCs . 

2.  Reckase's  (1977)  mean  square  deviation  (MSD)  Index  averaged  over  Items, 


2  / 

MSDj  =  Z  (Uij  -  Pij)  /  N 


where 

MSDj  ■  the  mean  squared  deviation  for  person 
u^j^j  »  the  actual  response  to  Item  ^  by  person 

probability  of  a  correct  response  as  predicted  by  the 
three-parameter  IRT  model,  and 
N  "  the  number  of  items  In  the  test. 

This  Index  Is  a  special  case  of  the  fit  of  the  observed  PRC 
to  the  expected  PRC. 

3.  A  variation  of  Reckase's  Index  in  which  only  Improbable  responses  are 
scored.  It  Is  difficult  to  argue  that  those  responses  that  are  In  the 
predicted  direction  should  be  Included  In  a  person-fit  statistic. 

Hence,  only  where  (^j  -  £ij)^  greater  than  .25  is  It  Included  In 

the  statistic.  The  divisor  of  the  statistic  Is  still  the  number  of 
items  In  the  test. 

4.  The  likelihood  ratio  Index  of  person  fit.  For  a  given  6  value,  a  prob¬ 
ability  distribution  for  the  possible  response  vectors  can  be  generat¬ 
ed.  The  probability  of  an  Individual's  actual  response  vector  Is  di¬ 
vided  by  the  probability  of  the  most  likely  response  vector  to  produce 
the  likelihood  ratio. 

5.  Wright  (1968)  proposed  an  Index  of  person  or  Item  fit,  which  Is  the  sum 
of  the  standardized  squares  of  the  residual  after  fitting  the  model. 

For  the  Rasch  model  it  is  for  an ^Incorrect  answer  and  for 

a  correct  answer,  where  e  *  2.71,  £  Is  6  for  an  Individual,  and  ^  Is 
the  difficulty  of  the  Item. 


6.  Response  pattern  Information,  which  reflects  the  flatness  of  the  like¬ 
lihood  function. 

7.  The  three  appropriateness  indices,  and  3  described  by  Levine 

and  Drasgow  (1980)  and  Levine  and  Rubin  (1979). 

8.  The  difference  between  the  difficulties  of  the  easiest  item  answered 
Incorrectly  and  the  most  difficult  item  answered  correctly. 

9.  Oonlon  and  Fisher's  (1968)  "personal  blserlal  correlation"  and  Jacobs' 
(1963)  variation  of  it. 

10.  The  posterior  variance  of  IRT-based  Bayesian  ability  estimates. 

The  utility  of  these  indices  for  identifying  individual  nonfit  to  IRT  mod¬ 
els  is  first  being  studied  in  simulation.  As  a  first  step,  the  null  distribu¬ 
tions  of  these  person-fit  statistics  are  being  examined,  to  serve  as  reference 
points  for  the  later  studies  to  determine  how  well  each  statistic  identifies 
person  nonfit.  Using  a  set  of  2,500  simulees  rectangularly  distributed  at  25 
equally  spaced  levels  of  6,  625  itoDS  were  administered  to  each  simulee.  Items 
were  rectangularly  distributed  in  ^  with  25  items  at  each  of  25  levels  of  ^  cor¬ 
responding  to  the  25  levels  of  6.  Within  each  level  of  Jb,  items  varied  in  dis¬ 
crimination  at  .07  intervals.  Simulation  data  were  generated  separately  for  the 
2,500  simulees  for  £  •  0.0,  .20,  and  .25  in  order  to  examine  the  effects  of 
guessing  on  the  null  distributions.  To  examine  the  effects  of  test  length  and 
discrimination,  shorter  tests  were  selected  from  the  item  pool  at  differing  lev¬ 
els  of  £.  For  each  of  these  test  configurations,  null  distributions  were  com¬ 
puted  for  each  of  the  person-fit  indices. 

After  distributions  of  these  indices  are  known  for  model-conforming  data, 
increasing  amounts  of  random  responses  will  be  added  to  the  response  vectors  to 
simulate  random  responding,  inattention  or  low  motivation,  and  guessing.  Chang¬ 
es  in  the  distributions  that  result  from  Increasingly  random  responding  will  be 
noted.  In  this  phase,  percent  of  random  response  is  an  additional  independent 
variable.  Those  person-fit  statistics  that  are  affected  most  strongly  by  random 
responding  will  be  retained  as  good  candidates  for  further  live-data  research. 
Results  of  the  data  analysis  are  not  yet  available  from  these  studies. 

Additional  Research  Plans 

The  effect  of  multidimensionality  on  the  person-fit  indices  will  be  exam¬ 
ined.  This  will  be  studied  by  generating  response  data  from  a  6  known  to  corre¬ 
late  to  varying  degrees  with  the  original  6  and  by  inserting  these  responses 
into  the  response  vectors.  Degree  of  correlation  as  well  as  number  of  dimen¬ 
sions  will  be  manipulated.  Dependent  variables  studied  will  be  the  ability  of 
the  person-fit  Indices  to  identify  the  existence  of  the  multidimensional  re¬ 
sponse  patterns. 

Once  promising  person-fit  indices  are  identified  in  the  simulation  studies, 
live-testing  studies  will  be  designed  in  which  the  fit  of  persons  to  the  IRT 
model  is  empirically  tested  on  a  given  pool  of  items.  Experimental  studies  will 
be  designed  to  attempt  to  induce  deviations  from  the  model  in  groups  of  individ¬ 
uals  and  to  observe  whether  the  person-fit  indices  are  sufficiently  sensitive  to 
identify  those  deviations  %dien  they  exist.  In  addition,  the  existence  of  natu- 
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rally  occurring  groups  in  which  deviation  from  the  IRT  model  might  occur  will  be 
studied.  For  example.  It  can  be  hypothesized  that  on  certain  kinds  of  subtests 
(e.g.,  verbal  ability  tests)  students  from  a  non-English  speaking  culture  would 
likely  show  significant  deviations  from  unldimenslonallty .  It  could  also  be 
hypothesized  that  students  who  are  not  ’’test— wise”  or  who  have  a  lack  of  famil¬ 
iarity  with  multiple-choice  tests  would  show  specific  deviations  from  IRT  models 
in  their  test  performance.  In  addition,  the  generality  of  person-fit  variables 
across  ability  dimensions  will  be  studied  In  order  to  determine  whether  devia¬ 
tions  of  fit  to  the  model  for  an  Individual  are  specific  to  an  item  pool  or  oc¬ 
cur  across  different  kinds  of  Item  domains. 

New  Types  of  Ability  Tests 


Psychometric  attempts  to  measure  Individual  differences  in  cognitive  abili¬ 
ties  during  the  last  60  years  have  produced  tests  of  global  abilities  such  as 
general  reasoning,  verbal  and  quantitative  abilities,  as  well  as  more  specific 
ability  ’’factors"  such  as  speed  and  flexibility  of  closure,  spatial  orientation, 
and  word  fluency  (French,  Ekstrom,  &  Price,  1963).  Research  in  adaptive  testing 
has  shown  that  the  precision  and  validity  of  ability  tests  can  be  Improved  by 
adaptive  testing  procedures.  At  the  same  time,  cognitive  psychologists  have 
developed  several  standard  tasks  or  paradigms  that  have  been  used  to  study  the 
mechanisms  and  structure  of  aspects  of  memory,  attention,  and  cognition  (e.g., 
Sperling,  1960;  Sternberg,  1966).  Attempts  to  assess  quantitative  differences 
between  Individuals  In  such  Information-processing  abilities  and  to  relate  them 
to  more  traditional  psychometric  measures  Is  a  fairly  recent  phenomenon  (Chlang 
&  Atkinson,  1976;  Day,  1977;  Hunt,  Frost,  &  Lunneborg,  1973;  Hunt,  Lunneborg,  & 
Lewis,  1975;  Lunneborg,  1977;  Rose,  1974).  Carroll  (1976)  has  provided  an  addi¬ 
tional  framework  for  relating  traditional  psychometric  factors  to  their  cogni¬ 
tive  Information-processing  requirements.  Some  of  this  research  has  suggested 
that  Individual  differences  In  such  Information-processing  abilities  can  be  mea¬ 
sured  reliably  by  Interactive  computers  and  that  these  abilities  may  add  Incre¬ 
mental  validity  in  predicting  external  job  criteria  (Cory,  1977;  Cory,  Rlmland, 

&  Bryson,  1977). 

In  the  past,  tasks  of  the  type  that  have  been  used  by  cognitive  psycholo¬ 
gists  to  measure  Informatlon-processli^  abilities  and  by  psychometrlclans  to 
measure  perceptual  and  spatial  factors  have  been  administered  as  blocks  of  fixed 
numbers  of  trials  or  replications.  As  a  result,  little  Is  known  about  employing 
the  Important  parameters  of  these  tasks  for  adapting  the  difficulty  level  of 
replications  to  converge  upon  an  Individual's  ability  level.  A  major  emphasis 
of  the  research,  therefore.  Involved  (1)  studying  some  of  these  tasks  from  the 
point  of  view  of  how  computerized  adaptive  administration  could  be  meaningfully 
achieved  and  (2)  evaluating  the  measurement  benefits  of  adaptive  administration 
of  ability  tests  designed  to  utilize  the  unique  capabilities  of  computerized 
administration. 

Objective 

The  objectives  were  to  Investigate  the  application  of  adaptive  testing 
techniques  to  Improving  the  measurement  characteristics  of  several  cognitive 
information-processing  tasks  (e.g.,  short-term  memory  span,  capacity  of  visual 
sensory  memory).  Computerized  administration  of  these  ability  tests  will  be 
Investigated  as  a  means  of  modifying  task  presentations  over  time  In  a  way  that 
would  not  be  possible  In  paper-and-pencll  testing. 
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Approach 

Two  types  of  ability  test  Items  that  utilized  the  unique  capabilities  of 
computer  administration — Memory  for  Patterns  and  Digit  Span — were  studied  to 
Investigate  the  feasibility  of  adaptive  administration  of  information-processing 
tests.  These  tasks  were  studied  from  the  point  of  view  of  enabling  the  computer 
to  adapt  the  difficulty  of  the  tasks  presented  to  the  ability  level  of  the 
testee  during  the  process  of  testing. 

Memory  for  Patterns.  This  test  was  based  on  a  procedure  devised  by 
Sperling  (1960)  for  determining  the  capacity  of  visual  sensory  memory.  The  pro¬ 
cedure  is  based  on  presentation  of  arrays  of  letters  that  must  be  studied  by  the 
testee.  The  procedure  will  be  modified  to  adapt  to  a  testee's  recall  on  previ¬ 
ous  screen  presentations  in  r^rier  tc  obtain  precise  quantitative  estimates  of 
individual  differences  in  visa'll  sensory  memory  capacity  with  fewer  replica¬ 
tions.  For  example,  the  size  ;  c  the  array  (and  thus  the  memory  capacity  de¬ 
mands)  can  be  made  larger  nr  smailer  based  on  an  individual's  previous  perfor¬ 
mance;  and/or  the  duration  of  the  array  presentation  could  be  lengthened  or 
shortened  to  adapt  the  task  difficulty  to  the  testee's  ability  level  during  the 
course  of  testing.  Prl.tr  to  implementing  such  an  adaptive  approach,  however, 
psychometric  research  is  needed  to  determine  how  much  of  an  Increase  or  decrease 
in  stimulus  array  size  constitutes  a  meaningful  Increment  or  decrement  in  diffi¬ 
culty  and  frtiat  ranges  of  array  size  are  needed  to  adequately  span  differences 
existing  in  various  populations  of  interest. 

Data  collection  on  two  tests  of  short-term  spatial  and  perceptual  memory 
was  therefore  designed  to  allow  a  preliminary  evaluation  of  potential  adaptive 
testing  parameters.  The  experimental  Memory  for  Patterns  items  consisted  of 
bounded  two-dimensional  arrays  containing  3  to  10  letters.  Each  item  consisted 
of  a  pair  of  successive  screen  presentations.  The  stimulus  display  was  present¬ 
ed  for  a  brief  timed  period  and  then  was  erased  from  the  cathode  ray  terminal 
(CRT)  and  replaced  with  the  recall  display.  The  recall  display  contained  a 
bounded  letter  pattern  that  was  identical  to  the  first  pattern  (the  stimulus 
display)  except  that  one  or  two  letters  had  changed  position.  The  recall  dis¬ 
play  was  untlmed  and  accepted  the  student's  response,  indicating  which  letter(s) 
she/he  thought  had  moved.  Figure  3  shows  sample  Memory  for  Patterns  stimulus 
and  response  display,  constituting  one  test  item.  Data  were  also  collected  on  a 
related  set  of  items  designed  to  measure  Space  Memory,  which  were  similar  to  the 
Memory  for  Patterns  items  except  that  the  letter  patterns  presented  were  un¬ 
bounded  and  thus  spread  about  the  entire  CRT  screen. 

In  the  initial  data  collection  for  the  Memory  for  Patterns  and  Space  Memory 
items,  subtests  varied  in  the  niunber  of  letters  that  could  change  position  from 
the  first  pattern  to  the  second.  Three  10-item  subtests  were  administered  to 
each  student  in  a  number  of  different  experimental  groups.  The  first  subtest 
was  composed  of  patterns  in  which  one  letter  changed  position,  the  second  sub¬ 
test  was  composed  of  patterns  in  which  two  letters  changed  position,  and  the 
third  subtest  Included  patterns  in  which  either  one  or  two  letters  changed  posi¬ 
tion.  Students  were  assigned  to  one  of  four  conditions  that  varied  the  order  in 
which  the  items  were  presented  and  the  presentation  time,  in  seconds,  of  the 
first  pattern  of  each  item  pair.  In  the  Memory  for  Patterns  test  the  two  pre¬ 
sentation  times  were  5  and  10  seconds,  and  in  the  Space  Memory  test  they  were  7 
and  12  seconds.  The  two  orders  in  which  items  were  presented  were  (1)  from  low¬ 
est  to  highest  difficulty  and  (2)  from  moderate  to  highest  to  lowest  difficulty. 


Figure  3 

A  Sample  Memory  for  Patterns  Item,  with  Stimulus  Display 
on  the  Left  and  Recall  Display  on  the  Right 


Type  the  letter  which  has  changed 
position,  or  a  question  mark  (?). 
Then  press  "RETURN". 
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where  difficulty  was  Indexed  by  the  number  of  letters  In  the  pattern.  Reactions 
to  the  tests  were  also  obtained  from  each  student. 

The  analysis  of  the  data  was  directed  at  scaling  the  difficulty  of  Items 
and  at  studying  the  effects  of  pattern  density  and  presentation  times  on  Item 
difficulty  Indices.  A  comparison  of  the  Item  order  conditions  was  made  to  exam¬ 
ine  the  effects  of  practice  and  proactive  Inhibition  on  indices  of  Item  diffi¬ 
culty.  A  comparison  of  presentation  time  conditions  was  directed  at  determining 
reasonable  Item  exposure  times  and  at  Investigating  the  possibility  of  using 
presentation  time  along  with  pattern  density  and  number  of  pattern  changes  as 
adaptive  parameters  for  future  adaptive  testing. 

The  results  of  preliminary  analyses  of  the  Memory  for  Patterns  items  were 
used  to  design  new  Items  and  to  modify  the  experimental  conditions  under  which 
they  were  administered.  The  major  conclusions  from  the  preliminary  analysis  was 
that  Item  difficulty  was  more  a  function  of  type  of  pattern  configuration  than 
either  rate  or  order  condition.  For  this  reason,  types  of  Memory  for  Patterns 
Items  were  hypothesized  in  a  systematic  manner.  Eight  Memory  for  Patterns  Item 
types,  which  can  be  separated  Into  two  groups,  were  developed.  One  group  of 
Items  Is  composed  of  patterns  taking  a  geometric  form,  such  as  a  line,  triangle, 
square,  or  pentagon.  Four  item  types  of  this  nature  were  developed: 

1.  An  Item  that  had  a  geometric  form  was  changed  to  a  nongeometrlc  form 
through  a  small  move, 

2.  An  Item  that  had  a  geometric  form  was  changed  to  a  nongeometrlc  form 
through  a  large  move, 

3.  An  Item  that  had  a  nongeometrlc  form  was  changed  Into  a  geometric  form 
through  a  small  move,  and 

4.  An  Item  that  had  a  nongeometrlc  form  was  changed  Into  a  geometric  form 
through  a  large  move. 

Within  the  four  Item  types,  there  were  various  degrees  of  nongeometrlc  form  that 
the  patterns  could  have. 

A  second  group  of  Items  was  composed  of  nongeometrlc  forms,  but  the  pat¬ 
terns  varied  In  terms  of  pattern  configuration  definition.  For  example,  some 
patterns,  although  not  geometric  In  form,  are  better  defined  and  thus  easier  to 
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remember.  The  four  item  types  were  as  follows: 

1.  Well-defined  pattern  configuration  with  a  small  change  In  letter  con¬ 
figuration, 

2.  Well-defined  pattern  configuration  with  a  large  change  In  letter  con¬ 
figuration, 

3.  Poorly  defined  pattern  configuration  tflth  a  small  change  In  letter  po¬ 
sition,  and 

4.  Poorly  defined  pattern  configuration  with  a  large  change  In  letter  po¬ 
sition. 

Hypotheses  were  made  with  regard  to  Item  type  and  resultant  Item  difficulty. 
Ninety  new  Items  were  written  based  on  these  eight  basic  Memory  for  Patterns 
Item  types. 

Of  the  90  Items,  42  were  administered  under  various  experimental  condi¬ 
tions.  Students  were  assigned  sequentially  to  one  of  15  testing  conditions. 

The  42  Memory  for  Patterns  Items  were  presented  under  three  order  and  five  rate 
conditions.  Values  of  pattern  densities,  or  the  number  of  letters  In  a  configu¬ 
ration  were  3,  4,  5,  6,  7,  8,  9,  and  10  letters.  The  42  Items  were  presented  In 
three  orders:  (1)  ordered  from  3  to  10  letters  In  a  pattern;  (2)  ordered  from  6 
to  10  letters,  then  3  to  5  letters  In  a  pattern;  and  (3)  ordered  from  8  to  10, 
then  3  to  7  letters  In  a  pattern.  The  five  rate  conditions  were  3,  5,  7,  10, 
and  13  seconds  In  duration.  Analyses  of  these  data  will  be  oriented  toward  the 
identification  of  adaptive  parameters  for  the  Items  and  the  effects  of  admlnls- 
tratlon  conditions  (e.g.,  sequence  effects)  on  the  adaptive  parameters. 

Digit  Span.  Contrasting  with  the  Memory  for  Patterns  tests,  which  tap  both 
spatial  abilities  and  short-term  memory  capacity,  the  Digit-Span  test  Is  primar¬ 
ily  a  test  of  short-term  memory.  In  this  test  a  series  of  numbers  Is  presented 
In  rapid  succession.  The  respondent  Is  asked  to  recall  the  numbers  In  the  order 
they  were  presented  by  typing  them  Into  the  computer  terminal  In  a  serial 
string.  Since  the  time  Interval  between  presenting  the  numbers,  clearing  the 
screen,  and  asking  for  a  response  was  very  short,  the  test  was  essentially  an 
Indicator  of  short-term  memory  capacity. 

To  Identify  possible  adaptive  parameters  for  this  type  of  test,  presenta¬ 
tion  rate  was  varied  experimentally  so  that  Items  were  presented  at  one  of  three 
rates — .2  seconds,  .3  seconds,  and  .5  seconds — corresponding  to  fast,  moderate, 
and  slow  rates.  Series  length  (the  number  of  stimulus  values  to  be  recalled),  a 
second  potential  adaptive  testing  parameter,  varied  from  4  to  10  stimuli.  Six 
Items  at  each  test  length  were  administered  to  yield  a  total  of  42  digit  series 
In  the  test.  Length  of  digit  series  and  of  presentation  rate  will  be  Investi¬ 
gated  as  possible  test  and  Item  parameters. 

The  Items  were  presented  In  three  different  orders:  (1)  from  easy  to  diffi¬ 
cult  (where  difficulty  was  defined  In  terms  of  numbers  of  stimulus  values  to 
recall),  (2)  from  moderately  difficult  to  difficult  to  easy,  and  (3)  from  diffi¬ 
cult  to  easy  to  moderately  difficult.  The  order  of  Item  presentation  will  be 
analyzed  to  determine  If  there  are  practice  or  prohibitive  effects  from  one  Item 
to  the  next,  since  such  effects  would  be  undesirable  In  adaptive  test  adminis¬ 
tration,  and  to  determine  the  effects  of  series  length  and  display  time  on  item 
difficulties. 


Analyses  and  Future  Plans 

Data  analyses  and  future  data  collection  »»111  be  oriented  toward  identify¬ 
ing  potential  adaptive  parameters  in  terms  of  factors  influencing  the  difficulty 
of  test  items*  In  addition,  the  influence  of  undesirable  factors — such  as  se¬ 
quence  effects,  inhibitive  effects,  and  other  factors  that  would  interfere  with 
adaptive  adininistration--will  be  investigated  in  order  to  permit  the  design  and 
evaluation  of  adaptive  tests  of  information-processing  abilities. 

Other  measures  of  more  traditional  psychometric  factors,  such  as  flexibili¬ 
ty  of  closure,  may  also  benefit  from  application  of  a  computerized  adaptive 
testing  framework.  A  common  measure  of  this  construct  has  been  variants  of  the 
Hidden  or  Embedded  Figures  test.  The  unique  capabilities  of  computerized  admin¬ 
istration  may  allow  Increased  validity  to  be  achieved  by  inducing  movement  into 
either  the  stem  figure  and/or  the  alternative  response  figures.  For  example, 
the  testee  may  be  required  to  selectively  attend  to  and  to  articulate  the  stem 
figure  in  a  more  complex  figure,  which  not  only  contains  distracting  lines  but 
also  grows,  shrinks,  translates,  and/or  rotates  over  time.  Adaptive  administra¬ 
tion  can  be  achieved  by  modifying  the  amount  of  "noise”  in  the  figure  and  by 
dynamically  adapting  the  amount  and  speed  of  movement  in  the  complex  figure. 
Several  psychometric  questions  can  be  studied.  For  example,  what  size  incre¬ 
ments  in  "noise"  and  movement  will  allow  the  computer  to  most  efficiently  con¬ 
verge  upon  the  testee 's  ability  level?  Are  amounts  of  "noise"  and  movement  in¬ 
dependent  dimensions  of  difficulty  to  be  manipulated?  How  is  performance  under 
varying  degrees  of  noise  and  movement  to  be  scored? 

The  design  and  implementation  of  adaptive  tests  of  these  information-pro¬ 
cessing  kinds  of  ability  tests  raises  a  host  of  new  questions  and  problems  to  be 
investigated.  Beyond  the  identification  of  adaptive  parameters  for  these  kinds 
of  items,  and  ruling  out  extraneous  factors  such  as  sequence  effects  (which  can 
reduce  the  effectiveness  of  the  adaptive  procedures),  new  questions  will  need  to 
be  addressed  with  regard  to  the  design  and  scoring  of  the  adaptive  tests.  De¬ 
sign  questions  will  Include  the  identification  of  the  functions  relating  display 
time  to  item  difficulty  and  Identification  of  the  procedures  for  using  this  con¬ 
tinuous  function  in  adapting  display  time  on  each  item  to  each  individual's  test 
performance  on  an  item-by-ltem  basis.  Since  it  may  be  observed  that  this  func¬ 
tion  is  different  for  items  of  different  difficulties  based  on  stimulus  charac¬ 
teristics  (e.g.,  pattern  density,  length  of  span  string),  adaptive  testing  pro¬ 
cedures  will  need  to  be  developed  that  will  jointly  take  into  account  the  combi¬ 
nation  of  discrete  and  continuous  difficulty  factors.  New  scoring  procedures 
may  also  need  to  be  designed  for  these  kinds  of  tests,  since  each  testee  will 
receive  a  set  of  items  selected  to  match  his/her  ability  levels;  and/or  the  ap¬ 
plicability  of  IRT-based  scoring  procedures  will  need  to  be  Investigated. 

Finally,  comparisons  of  computerized  adaptive,  computerized  nonadaptive  and 
traditional  measures  of  these  abilities  should  be  made  to  determine  if  more  pre¬ 
cise  and  efficient  measurement  can  be  achieved  through  adaptive  administration. 
Where  appropriate  external  criteria  are  available,  predictive  validity  compari¬ 
sons  should  also  be  made.  If  the  findings  from  traditional  ability  testing  gen¬ 
eralize  to  these  kinds  of  ability  tests,  the  resulting  tests  will  be  shorter, 
more  precise,  and  more  valid  and  will  permit  more  meaningful  measurement  of  the 
range  of  human  abilities. 
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